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Common Game Experiments 


Testing performance of new features 
Testing the tuning of different game variables 
Testing new user flows 
Testing new, better payment flows 
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players into 
groups 
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Randomly split 
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treatments to 
each group 
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Step 3 


Step 4 


Evaluate 
performance for 
each group 
against control 


Pick a winner 
and make the 
change 
permanent 











Advantages of A/B Testing 


Simple concept to understand 
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Hard to Implement 



Standard Process 



Standard Process 


Once the test runs for few days, the winner of the experiment 
is picked by comparing averages (e.g., Rev/User or 
ARPDAU) between the groups, amongst other metrics 



Standard Process 


Once the test runs for few days, the winner of the experiment 
is picked by comparing averages (e.g., Rev/User or 
ARPDAU) between the groups, amongst other metrics 

At-test is used to determine the statistical significance of 
results 



When the Results are Positive 


When the Results are Positive 


Are they positive because the new feature 

did well ? 


When the Results are Positive 


Are they positive because the new feature 

did well ? 

Or, are they positive because of 
the players who happen to be there ? 


Can Signal be Overpowered by 
Noise Using Naive Methods ? 


We Ran 500 A/A Tests 


A/A = No difference in the experience 

between the two groups 


And compared the performance of the two 

groups 
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500 Random A/A Trials Comparing Rev/User 
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^“Least Skewed Game 



1 out of 5 times, there is a difference of > 
1 .4% in Rev/User between the two groups 
in-game that is not significantly skewed 
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500 Random A/A Trials Comparing Rev/User 



Trial Number Sorted by %difference 


Medium Skewed 
Game 


^ Least Skewed Game 
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500 Random A/A Trials Comparing Rev/User 



Trial Number Sorted by %difference 


Highly Skewed Game 


^“Medium Skewed 
Game 

^“Least Skewed Game 
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1 out of 5 times, there is a difference of 
> 3.2% in Rev/User between two groups in- 
game that is very skewed 
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Hypothetically, lets say all payers spent the 
same amount of money in the game 
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Percent Cumulative Revenue by Payer Percentile 
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—♦—"All Payers pay equally' 



In Reality . 



Percent Cumulative Revenue by Payer Percentile 
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“^Least Skewed Game —♦—"All Payers pay equally 1 
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^^Least Skewed Game —^Medium Skewed Game —♦—"All Payers pay equally' 



Percent Cumulative Revenue by Payer Percentile 
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Least Skewed Game 


Medium Skewed Game 


Highly Skewed Game 


'All Payers pay equally 1 



Revenue Split Between Payers 


Least Skewed Game 



Top 1% 

■Top 2% - 10% 
Bottom 90% 
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Revenue Split Between Payers 


Least Skewed Game 


Medium Skewed Game 



Top 1% 

■Top 2% - 10% 
Bottom 90% 


14% 



Top 1% 

■Top 2% - 10% 
Bottom 90% 
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Revenue Split Between Payers 


Least Skewed Game 


Medium Skewed Game 



Top 1% 

■Top 2% - 10% 
Bottom 90% 


Most Skewed Game 


14% 



Top 1% 

■Top 2% - 10% 
Bottom 90% 


19% 
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Most Games Have a Non-Normal 

Distribution 



Most Games Have a Non-Normal 

Distribution 

Unequal Split of Top Spenders Can Cause 

Bias in the Split of Users 
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Results that look great, might in reality be 

underperforming variations 
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Variant A Might Look to be Performing 

“Better” Than B 
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Experiment 
Start Date 


Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 
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But in Reality it Might be Performing Worse 
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Goal 


Fast decision making that allows rapid iteration 


Hide the complexity of understanding distributions 


Empower product managers to run more experiments 
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Challenge 


Data Scientists can’t analyze every experiment 


Decision makers don’t have infinite time to make a decision 
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Dual Control 
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0.16 



0.1 


0.08 ^_ 2 
0.06 
0.04 
0.02 
0 

1 / 1/2016 1 / 2/2016 1 / 3/2016 1 / 4/2016 1 / 5/2016 1 / 6/2016 1 / 7/2016 1 / 8/2016 1 / 9/2016 1 / 10/2016 1 / 11/2016 1 / 12/2016 1 / 13/2016 1 / 14/2016 1 / 15/2016 



Methodology 


Naive 


Dual Control 


Highly Prone to Top Spender 
Imbalance 

Only Provides a Visual 
Representation of Natural 
Variance Between Controls 






Methodology 


Naive 

Dual Control 
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Highly Prone to Top Spender 
Imbalance 

Only Provides a Visual 
Representation of Natural 
Variance Between Controls 






Mann Whitney U 


Rank 

Revenue 

Variant 

i 

$170.0 

B 

2 

$133.0 

A 

3 

$129.0 

A 

4 

$110.0 

A 

5 

$90.0 

B 

6 

$88.0 

B 

7 

$75.0 

A 

8 

$66.0 

A 

9 

$65.0 

A 

10 

$60.0 

B 

11 

$59.0 

B 

12 

$58.0 

B 

13 

$55.0 

B 

14 

$50.0 

A 

15 

$48.0 

A 

16 

$46.0 

B 


Sum of B Rank 
74 

Sum of A Rank 
62 




Methodology 


Naive 

Dual Control 


Mann-Whitney U 


Highly Prone to Top Spender 
Imbalance 

Only Provides a Visual 
Representation of Natural 
Variance Between Controls 

Statistical Sledgehammer That 
Gives Perfect Results but 
Doesn’t Tell the Magnitude of 
the Change 





Methodology 


Naive 

Dual Control 


Mann-Whitney U 


Pre-Post 


Highly Prone to Top Spender 
Imbalance 

Only Provides a Visual 
Representation of Natural 
Variance Between Controls 

Statistical Sledgehammer That 
Gives Perfect Results but 
Doesn’t Tell the Magnitude of 
the Change 
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Pre - Post 


Compare the difference between the performance of 
the a group of users before and after the test i.e 
sum(pre-test values)/count(users) to 
sum (post-test values)/count(users) 
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Pre - Post 


Pre-Test Post-Test Difference 

Values Values 


Control XI Y1 Z1=(Y1-X1)/ 

XI 

Test X2 Y2 Z2 = (Y2-X2 )/ 

X2 






Pre - Post 


Average 1 0th 25th median 75th 90th 


Normal 

1.15% 

0.18% 

0.44% 

0.95% 

1 .63% 

2.44% 

Pre-Post 

0.86% 

0.15% 

0.34% 

0.73% 

1 .26% 

1 .77% 

% Gain 

-25.01 % 

-15.99% 

-21.74% 

-22.40% 

-22.71% 

-27.54% 


Pre-post reduced the average noise to 0.86% 
(25.01% less) 





Methodology 


Naive 

Dual Control 


Mann-Whitney U 


Pre-Post 


Highly Prone to Whale 
Imbalance 

Only Provides a Visual 
Representation of Natural 
Variance Between Controls 

Statistical Sledgehammer that 
Gives Perfect Results but 
Doesn’t Tell the Magnitude of 
the Change 


Doesn’t Account for Non-Payers 
and New Installs 





Methodology 


Naive 

Dual Control 


Mann-Whitney U 


Pre-Post 

Neighborhood Band 
Normalization 


Highly Prone to Whale 
Imbalance 

Only Provides a Visual 
Representation of Natural 
Variance Between Controls 

Statistical Sledgehammer that 
Gives Perfect Results but 
Doesn’t Tell the Magnitude of 
the Change 


Doesn’t Account for Non-Payers 
and New Installs 





Neighborhood Band Normalization 


Performance of the Variant = 
Sum of Actual Results/ 
Sum of Estimated Results 



Neighborhood Band Normalization 
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Neighborhood Band Normalization 


Rank users bases on prior-to-test features 
• Prior rev, prior game actions, prior engagement, geo 

Post Rev estimation = Average post rev of the 100 users ranked 
above and below them (w / adjustment factors for those who don’t 
have 100 above or 100 below) 


Makes estimations for non-payers and installs 



Neighborhood Band Normalization 







500 Random A/A Trials for Highly Skewed Games 
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ARPDAU — Neighborhood Band Normalization 



Normalizing Actual Results by Predictive 
Results Reduces the Noise by 31% 
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It is Important to Take Prior Information 

Into Account 



3 variations of a feature that 
grants different rewards showed 
the following result based on Rev/User 



3 variations of a feature that 
grants different rewards showed 
the following result based on Rev/User 


% Difference from Control 

Naive Neighborhood Band 

Normalization 

Variation 1 7.33% 

Variation 2 4.62% 

Variation 3 2.23% 




The group that was exposed to Variation 1 , 
had 40% more Top 1 % payers than Control 
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3 variations of a feature that 
grants different rewards showed 
the following result based on Rev/User 



% Difference from Control 


Naive 

Neighborhood Band 
Normalization 

Variation 1 

7.33% 

-7.24% 

Variation 2 

4.62% 

-7.17% 

Variation 3 

2.23% 

11.65% 
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Methodology 


Naive 

Dual Control 


Mann-Whitney U 


Pre-Post 

Neighborhood Band 
Normalization 


Highly Prone to Whale 
Imbalance 

Only Provides a Visual 
Representation of Natural 
Variance Between Controls 

Statistical Sledgehammer that 
Gives Perfect Results but 
Doesn’t Tell the Magnitude of 
the Change 

Doesn’t Account for Non-Payers 
and New Installs 

It’s Better on Average But Not 

Always 





Always Room for a Better Methodology 



Always Room for a Better Methodology 


Continued investment into Bayesian Methods 
to find a more robust approach that can withstand any 

distribution found in games 
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Common Pitfalls 


Setting up experiment correctly 

Testing things that are actually meaningful 

Too many experiments going on 

Not analyzing the right metrics 

Not understanding how your top payers behave 



Games Are Pieces of Art as Much as Science 


• Sometime testing is only good to get a 
directional sense 

• Don’t let data govern you 

• Trust your intuition 

• Common sense over data 
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Thank You 


Questions ? 
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Henry Phillips : hphillips@zynga.com 
Anshul Dhawan : adhawan@zynga.com | @theanshuldhawan 


